Visualisation with ggplot

Etienne Côme

October, 29 2024

Visualisation ?

“Transformation of the symbolic into the geometric”
[McCormick et al. 1987]

“… finding the artificial memory that best supports our natural means of perception.”
[Bertin 1967]

“The use of computer-generated, interactive, visual representations of data to amplify cognition.”
[Card, Mackinlay, & Shneiderman 1999]

Why visualize?

Integrating the human in the loop

  • Answer questions or find questions?
  • Making decisions
  • Putting data in context
  • Amplify the memory
  • Graphic calculation
  • Find schematics and patterns
  • Presenting arguments

Why visualize?

Analyze :

  • Developing and criticizing hypotheses
  • Discovering errors
  • Find patterns

Communicate

  • Sharing and convincing
  • Collaborate and review

Anscombe quartet

g mean_x mean_y sd_x sd_y
1 9 7.500909 3.316625 2.031568
2 9 7.500909 3.316625 2.031657
3 9 7.500000 3.316625 2.030424
4 9 7.500909 3.316625 2.030578

Anscombe quartet

Cholera map (John Snow)

Visualization

=

encode the data using

visual chanels

Visual channels

Bertin Jacques, Sémiologie graphique, Paris, Mouton/Gauthier-Villars, 1967.

Visual channels

Bertin Jacques, Sémiologie graphique, Paris, Mouton/Gauthier-Villars, 1967.

Marks // visuals channels

Marks :

graphical building blocks

Visual channels :

The visual properties that varie

Marks, visual channels

Marks, visual channels

All channels are not equals

Marks, visual channels

The best channels depend on the feature type (continuous, categorical, ordinal,…)

Marks, visual channels

The interesting part is not already available

pre-attentive processing

How many 3 ?

1281768756138976546984506985604982826762 9809858458224509856458945098450980943585 9091030209905959595772564675050678904567 8845789809821677654876364908560912949686

pre-attentive processing

How many 3 ?

1281768756138976546984506985604982826762 9809858458224509856458945098450980943585 9091030209905959595772564675050678904567 8845789809821677654876364908560912949686

pre-attentive processing

pre-attentive processing

pre-attentive processing

library(ggplot2)
ggplot(mpg)+geom_point(aes(x=cty,y=hwy,color=class))

Questions ? Features types ?

continuous ? discretes ? ordinals ? temporal ? spatials ?

Some categories

and

one quantity for each modality

The bar chart

le bar chart

library(rjson)
library(dplyr)
?mpg

dataset mpg

  • manufacturer.

  • model.

  • displ. engine displacement, in litres

The bar chart

m_cty = mpg %>% group_by(manufacturer) %>% summarize(mcty=mean(cty))
ggplot(data=m_cty)+
  geom_bar(aes(x=manufacturer,y=mcty),stat = 'identity')+
  scale_x_discrete("Manufacturer")+
  scale_y_continuous("Miles / Gallon (City conditions)")

Order ?

m_cty_ordered = m_cty %>% arrange(desc(mcty)) %>% 
  mutate(manufacturer=factor(manufacturer,levels=manufacturer))
ggplot(data=m_cty_ordered)+
  geom_bar(aes(x=manufacturer,y=mcty),stat = 'identity')+
  scale_x_discrete("Manufacturer")+
  scale_y_continuous("Miles / Gallon (City conditions)")

Horizontal ?

ggplot(data=m_cty_ordered)+
  geom_bar(aes(x=manufacturer,y=mcty),stat = 'identity')+
  scale_x_discrete("Manufacturer")+
  scale_y_continuous("Miles / Gallon (City conditions)")+
  coord_flip()

The ligne :

1 numeric variable

with respect

to time

Vélib’ data :

url="./data/sp_Lyon.json"
library(dplyr)
# read some data
data=fromJSON(file=url)
# to data.frame
extract = function(x){
  data.frame(id=x$'_id',
             time= x$download_date,
             nbbikes = x$available_bikes )
  }
st_tempstats.df=do.call(rbind,lapply(data,extract))
tempstats.df=st_tempstats.df |> group_by(time) |> summarise(nbbikes = sum(nbbikes))

Time, natural order

ggplot(data=tempstats.df,aes(x=time,y=nbbikes))+geom_point()

Time, natural order

ggplot(data=tempstats.df,aes(x=time,y=nbbikes))+geom_line()

Aspect ratio

ggplot(data=tempstats.df,aes(x=time,y=nbbikes))+geom_line()

Aspect ratio

ggplot(data=tempstats.df,aes(x=time,y=nbbikes))+geom_line()

Aspect ratio

ggplot(data=tempstats.df,aes(x=time,y=nbbikes))+geom_line()

Aspect ratio, 45°

Heuristic: use the aspect ratio that results in an average line slope of 45°.

Cleveland, William S., Marylyn E. McGill, and Robert McGill. “The shape parameter of a two-variable graph.” Journal of the American Statistical Association 83.402 (1988): 289-300.

Area + Scale

ggplot(data=tempstats.df,aes(x=time,y=nbbikes))+geom_area()

Point of view

ggplot(data=tempstats.df,aes(x=time,y=max(nbbikes)-nbbikes))+
  geom_area()

1 numeric variable

with respect

to time

+ categories

Velib data per stations

# read data and pre-processing
url = "./data/sp_Lyon.json"
data=fromJSON(file=url)
extract = function(x){
  data.frame(id=x$'_id',
             time= x$download_date,
             nbbikes = x$available_bikes )
  }
st_tempstats.df=do.call(rbind,lapply(data,extract))
sel = st_tempstats.df %>% select(id) %>% unique() %>% sample_n(8) %>% pull()
# selection de quelques stations
st_tempstats_sub.df = st_tempstats.df %>% 
  filter(id %in% sel)

Multiple line charts

ggplot(data=st_tempstats_sub.df)+
  geom_line(aes(x=time,y=nbbikes,group=id,color=factor(id)),size=2)

Small multiples

ggplot(data=st_tempstats_sub.df)+
  geom_line(aes(x=time,y=nbbikes,group=id,color=factor(id)),size=2)+
  facet_grid(id ~ .)

2 numeric features

+ categories

Scatter plot + colors

mpg_su = mpg %>% 
  filter(class %in% c('compact','suv','pickup','minivan')) 
ggplot(mpg_su)+geom_point(aes(x=cty,y=hwy,color=class))

Scatter plot + symbols

mpg_su = mpg %>% 
  filter(class %in% c('compact','suv','pickup','minivan')) 
ggplot(mpg_su)+geom_point(aes(x=cty,y=hwy,shape=class))

3 numeric features (with one >0)

+ categories

Scatter plot + color + size

ggplot(mpg_su)+geom_point(aes(x=cty,y=hwy,color=class,size=displ))

Scatter plot + color + size ! scales

ggplot(mpg_su)+geom_point(aes(x=cty,y=hwy,color=class,size=displ))

Circle size : radius or area ?

Rayon

Aire

More complex graphics

https://www.data-to-viz.com/

ex: More than 3 continuous variables ?

More complex graphics

https://www.data-to-viz.com/

ex: More than 3 continuous variables ?

More complex graphics

https://www.data-to-viz.com/

ex: Multimodal distributions ?

Principle :

\[\textrm{Lie factor} = \frac{\textrm{visual effect size}}{\textrm{data effect size}}\]

Lie factor :

\[\textrm{data effect size} = \frac{27.5 - 18}{18} \times 100 = 53 \%\]

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Lie factor :

\[\textrm{visual effect size} = \frac{5.3 -0.6}{0.6} \times 100 = 783 \%\]

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Lie factor :

\[\textrm{Lie factor} = \frac{783}{53} = 14.8\]

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Lie factor : 9.4

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

knowing that the “apple”” area (in green ) is equal to \(2.22\,cm^2\) and that the rim area (in blue) is equal to \(2.96\,cm^2\) compute the lyong factor ?

Perception

\[S = I^p\]

Principle :

Increase the data density

\[\textrm{graph data density} = \frac{\textrm{number of entries in data matrix}}{\textrm{area of data display}} \]

Data density :

Avoid graphics with low data density

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Data density :

Avoid graphics with low data density

Edward Tufte, The Visual Display of Quantitative Information, Cheshire, CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)

Principle :

Increase the data-ink ratio

\[\textrm{data-ink ratio} = \frac{\textrm{area of data-ink}}{\textrm{total area of ink}}\]

Data-ink ratio :

Data-ink ratio :

Remove to improve

https://speakerdeck.com/cherdarchuk/remove-to-improve-the-data-ink-ratio

Data-ink ratio :

Remove to improve

https://www.youtube.com/watch?v=bDbJBWvonVI

Recap

  • Avoid misleading graphics !
  • Avoid empty graphics
  • Be parsimonius with ink
  • Scales !, (!colors, !size)
  • Use explicit labels and
  • ! categorial features and order
  • aspect ratio
  • filetype pdf, svg // png,jpg

ggplot

gg = grammar of graphics

  • “The Grammar of Graphics” (Wilkinson, Annand and Grossman, 2005)
  • grammar → same language for all figures

ggplot

building blocks of the grammar

  • the coordinate system
  • data and aesthetic mappings,
    ex : f(data) → x position, y position, size, shape, color
  • geometric objects,
    ex : points, lines, bars, texts
  • scales,
    ex : f([0, 100]) → [0, 5] px
  • facet specification,
    ex : split the data into several plots
  • statistical transformations,
    ex : average, coounting, regression

ggplot

Make a graphic :

  • add several layers
  • with their own visual encoding and possibly their own data
  • (+ optionel) add statistical transformation
  • (+ optionel) change scale options
  • (+ optionel) specify title, theme, guides, style …


! data = tidy data.frame with the right feature types

ggplot, géométries

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)

Exemple


ggplot(mpg)+
  geom_point(aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))
ggplot(mpg,aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))+
  geom_jitter()

ggplot

ggplot(mpg,aes(x=cty,y=hwy,color=class))+geom_point()

ggplot

ggplot(mpg,aes(x=cty,y=hwy,color=class))+geom_jitter()

ggplot

ggplot(mpg,aes(x=cty,fill=class))+geom_histogram(binwidth=2)

ggplot

ggplot(mpg,aes(y=cty,x=class))+geom_violin()

ggplot, scales

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)
  • (+ optionel) change scale options
    scale_fill_brewer(palette=3,type="qual")
    scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))

ggplot, scales

ggplot(mpg,aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))+
  geom_jitter()+
  scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))

Colors

scales

Color scales

http://colorbrewer2.org/

ggplot, faceting

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)
  • (+ optionel) change scale options
    scale_fill_brewer(palette=3,type="qual")
    scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
  • use facet ?
    facet_grid(. ~ cyl)

ggplot, faceting

ggplot(data=mpg,aes(x=hwy,y=cty,color=class))+
  geom_point()+
  facet_wrap(~year)

ggplot, stats

Make a graphic :

  • add several layers
    +geom_line()
  • with their own visual encoding and possibly their own data
    aes(x=a,y=b,...)
  • (+ optionel) change scale options
    scale_fill_brewer(palette=3,type="qual")
    scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
  • add statistics
    stat_density2d()

ggplot

ggplot(mpg,aes(y=cty,x=hwy))+
  geom_point(color="blue")+stat_density2d()

ggplot

ggplot(mpg,aes(y=cty,x=hwy))+
  geom_point(color="blue")+stat_smooth()

ggplot

library(hexbin)
ggplot(mpg,aes(y=cty,x=hwy))+
  stat_binhex()

Sources

Exercises

Update the scale and labels

# téléchargement et remise en forme des données
url = "./data/sp_Lyon.json"
data=fromJSON(file=url)
extract = function(x){
  data.frame(id=x$'_id',
             time= x$download_date,
             nbbikes = x$available_bikes )
  }
st_tempstats.df=do.call(rbind,lapply(data,extract))
# selection de 3 stations
st_tempstats_sub.df = st_tempstats.df %>% 
  filter(id %in% sel)
ggplot(data=st_tempstats_sub.df)+
  geom_line(aes(x=time,y=nbbikes,group=id,color=factor(id)),size=2)+
  facet_grid(id ~ .)

Exercises

Update the scale and labels

Exercises

Reproduce this graphic (Iris data)

## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): aucun argument trouvé pour min ; Inf est renvoyé
## Warning in max(x): aucun argument pour max ; -Inf est renvoyé
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): aucun argument trouvé pour min ; Inf est renvoyé
## Warning in max(x): aucun argument pour max ; -Inf est renvoyé

Exercices

Reproduce this graphic (mtcars data) ! modifier le theme du graphique ?theme

Exercises

Reproduce this graphic

Exercises

Reproduce this graphic Informations :
  • Bike sharing data from lyon (data folder)
  • Compute the occupancy rate nb bikes / max(nb bikes)
  • pivot to wide
  • do a k-means with 8 clusters X (rows = stations, column = time slot)
  • facet + mean curve + alpha blending